The data set consists of:
iTRAQ proteome profiling of 77 breast cancer samples + 3 healthy samples, with expression values for ~12.000 proteins for each sample.
A file containing the clinical data of the 77 breast cancer patients (TCGA ID, sex, age, tumor receptors, etc.).
A file containing the list of genes and proteins used by the PAM50 classification system.
The analysis of this data set is relevant for multiple potential applications: expression analysis to identify biomarkers, understand disease heterogeneity, and infer personalized treatment strategies in breast cancer.
Description of relevant variables
Dropped variables (redundant or not relevant): Survival.Data.Form, Days.to.date.of.Death, Days.to.Date.of.Last.Contact, OS.Time, Vital.Status, Tumor..T1.Coded, Metastasis.Coded, AJCC.Stage, Converted.Stage and all the columns destined to cluster the data
Created variables:
Age.Ini.Diagnostic.group: intervals of 10 years starting from 30 and going up until 90.
Age.Menopausal.group: [30, 45) Pre-menopausal, [45, 55) Menopausal, [55, 90) Post-menopausal.
ER_PR_HER2: level from 0 to 7 depending on hormonal receptors (ER, PR) present an the level of HER2.
TNBC: 0 if positive, 1 if negative
AJCC.Simp: simplified AJCC stages (I, II, III, and IV)
Description of relevant variables
Two functions were created allowing to easily perform several comparisons with the present data.
DEA_proteins()
DEA_proteins <- function(data_in, condition_test){
col_name <- deparse(substitute(condition_test))
data_long <- data_in |>
dplyr::select(matches("^NP"),
matches("^XP"),
matches("^YP"),
{{ condition_test }}) |>
pivot_longer(cols = -{{ condition_test }},
names_to = "Protein",
values_to = "log2_iTRAQ")
data_long_nested <- data_long |>
group_by(Protein) |>
nest() |>
ungroup()
data_w_model <- data_long_nested |>
group_by(Protein) |>
mutate(model_object = map(.x = data,
.f = ~lm(formula = str_c("log2_iTRAQ ~", col_name) ,
data = .x)))
data_w_model <- data_w_model |>
mutate(model_object_tidy = map(.x = model_object,
.f = ~tidy(.x,
conf.int = TRUE,
conf.level = 0.95)))
estimates <- data_w_model |>
unnest(model_object_tidy) |>
filter(term == col_name) |>
ungroup() |>
dplyr::select(Protein, p.value, estimate, conf.low, conf.high) |>
mutate(q.value = p.adjust(p.value)) |>
mutate(dif_exp = case_when(q.value <= 0.05 & estimate > 0 ~ "Up",
q.value <= 0.05 & estimate < 0 ~ "Down",
q.value > 0.05 ~ 'NS'))
plt_volcano <- volcano_plot(estimates, col_name)
return(list(estimates=estimates, plt_volcano=plt_volcano))
}volcano_plot()
volcano_plot <- function(data, condition_test){
plt <- data |>
group_by(dif_exp) |>
mutate(label = case_when(dif_exp == "Up" ~ str_c(dif_exp,
" (Count: ",
n(),
")" ),
dif_exp == "Down" ~ str_c(dif_exp,
" (Count: ",
n(),
")" ),
dif_exp == "NS" ~ str_c(dif_exp))) |>
ggplot(aes(x = estimate,
y = -log10(p.value),
colour = label)) +
geom_point(alpha = 0.4,
shape = "circle") +
labs(title = str_c("Differentially expressed proteins in the test: ",
condition_test,
" vs. Non-",
condition_test),
subtitle = "Proteins highlighted in either red or blue were
\nsignificant after multiple test correction",
x = "Estimates",
y = expression(-log[10]~(p)),
color = "Differential expression") +
scale_color_manual(values = c("blue",
"grey",
"red")) +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
return(plt)
}NP_002094.2: glycogen [starch] synthase, muscle isoform 1